Problem Statement¶


Business Context¶

Understanding customer personality and behavior is pivotal for businesses to enhance customer satisfaction and increase revenue. Segmentation based on a customer's personality, demographics, and purchasing behavior allows companies to create tailored marketing campaigns, improve customer retention, and optimize product offerings.

A leading retail company with a rapidly growing customer base seeks to gain deeper insights into their customers' profiles. The company recognizes that understanding customer personalities, lifestyles, and purchasing habits can unlock significant opportunities for personalizing marketing strategies and creating loyalty programs. These insights can help address critical business challenges, such as improving the effectiveness of marketing campaigns, identifying high-value customer groups, and fostering long-term relationships with customers.

With the competition intensifying in the retail space, moving away from generic strategies to more targeted and personalized approaches is essential for sustaining a competitive edge.


Objective¶

In an effort to optimize marketing efficiency and enhance customer experience, the company has embarked on a mission to identify distinct customer segments. By understanding the characteristics, preferences, and behaviors of each group, the company aims to:

  1. Develop personalized marketing campaigns to increase conversion rates.
  2. Create effective retention strategies for high-value customers.
  3. Optimize resource allocation, such as inventory management, pricing strategies, and store layouts.

As a data scientist tasked with this project, your responsibility is to analyze the given customer data, apply machine learning techniques to segment the customer base, and provide actionable insights into the characteristics of each segment.


Data Dictionary¶

The dataset includes historical data on customer demographics, personality traits, and purchasing behaviors. Key attributes are:

  1. Customer Information

    • ID: Unique identifier for each customer.
    • Year_Birth: Customer's year of birth.
    • Education: Education level of the customer.
    • Marital_Status: Marital status of the customer.
    • Income: Yearly household income (in dollars).
    • Kidhome: Number of children in the household.
    • Teenhome: Number of teenagers in the household.
    • Dt_Customer: Date when the customer enrolled with the company.
    • Recency: Number of days since the customer’s last purchase.
    • Complain: Whether the customer complained in the last 2 years (1 for yes, 0 for no).
  2. Spending Information (Last 2 Years)

    • MntWines: Amount spent on wine.
    • MntFruits: Amount spent on fruits.
    • MntMeatProducts: Amount spent on meat.
    • MntFishProducts: Amount spent on fish.
    • MntSweetProducts: Amount spent on sweets.
    • MntGoldProds: Amount spent on gold products.
  3. Purchase and Campaign Interaction

    • NumDealsPurchases: Number of purchases made using a discount.
    • AcceptedCmp1: Response to the 1st campaign (1 for yes, 0 for no).
    • AcceptedCmp2: Response to the 2nd campaign (1 for yes, 0 for no).
    • AcceptedCmp3: Response to the 3rd campaign (1 for yes, 0 for no).
    • AcceptedCmp4: Response to the 4th campaign (1 for yes, 0 for no).
    • AcceptedCmp5: Response to the 5th campaign (1 for yes, 0 for no).
    • Response: Response to the last campaign (1 for yes, 0 for no).
  4. Shopping Behavior

    • NumWebPurchases: Number of purchases made through the company’s website.
    • NumCatalogPurchases: Number of purchases made using catalogs.
    • NumStorePurchases: Number of purchases made directly in stores.
    • NumWebVisitsMonth: Number of visits to the company’s website in the last month.

Problem Definition¶

The company is into retail and wants to segment an ever-increasing customer base based on personality traits, demographics, and buying behavior with a view to effective marketing and customer experience. Traditional one-size-fits-all marketing techniques cannot work when data-driven approaches are necessary in personalizing customer interactions. The company will make use of machine learning techniques to identify distinct groups of customers for targeted campaigns, better retention strategies, and efficient resource allocation. The ultimate goal of this is enhancing customer satisfaction for increased revenue and thereby competitiveness within the changing retail environment.

Importing necessary libraries¶

In [1]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings

warnings.filterwarnings("ignore")

Loading the data¶

In [2]:
# uncomment and run the following line if using Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
In [3]:
# Code to let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
# loading data into a pandas dataframe
data = pd.read_csv("/content/drive/My Drive/marketing_campaign.csv", sep="\t")

Data Overview¶

In [5]:
# Code to check the first 5 rows of the dataset
data.head()
Out[5]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 3 11 0
In [6]:
# Code to check the last 5 rows of the dataset
data.tail()
Out[6]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
2235 10870 1967 Graduation Married 61223.0 0 1 13-06-2013 46 709 43 182 42 118 247 2 9 3 4 5 0 0 0 0 0 0 3 11 0
2236 4001 1946 PhD Together 64014.0 2 1 10-06-2014 56 406 0 30 0 0 8 7 8 2 5 7 0 0 0 1 0 0 3 11 0
2237 7270 1981 Graduation Divorced 56981.0 0 0 25-01-2014 91 908 48 217 32 12 24 1 2 3 13 6 0 1 0 0 0 0 3 11 0
2238 8235 1956 Master Together 69245.0 0 1 24-01-2014 8 428 30 214 80 30 61 2 6 5 10 3 0 0 0 0 0 0 3 11 0
2239 9405 1954 PhD Married 52869.0 1 1 15-10-2012 40 84 3 61 2 1 21 3 3 1 4 7 0 0 0 0 0 0 3 11 1

Observations

The data has been loaded correctly. We can proceed to perform further analysis on the dataset.

In [7]:
# creating a copy of the data
data = data.copy()
In [8]:
# Code to check the shape of the dataset
num_rows, num_cols = data.shape

# Adding narrative to the output
print(f"The dataset has {num_rows} rows and {num_cols} columns.")
print(f"This means there are {num_rows} observations and {num_cols} features in the data.")
The dataset has 2240 rows and 29 columns.
This means there are 2240 observations and 29 features in the data.

Calculating the age of the customer using the "Year Birth"¶

In [9]:
from datetime import datetime

# Get the current year
current_year = datetime.now().year

# Calculate the age of the customer
data['Age'] = current_year - data['Year_Birth']

Question 1: What are the data types of all the columns?¶

In [10]:
# Code to check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 30 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
 29  Age                  2240 non-null   int64  
dtypes: float64(1), int64(26), object(3)
memory usage: 525.1+ KB
Observations:¶
  • The data type is made up of 1 float, 26 integers, and 3 objects, with a memory usage of 525.1KB

Question 2: Check the statistical summary of the data. What is the average household income?¶

In [11]:
# Code to check the statistical summary of the dataset
data.describe().T
Out[11]:
count mean std min 25% 50% 75% max
ID 2240.0 5592.159821 3246.662198 0.0 2828.25 5458.5 8427.75 11191.0
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.013393 0.114976 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Z_CostContact 2240.0 3.000000 0.000000 3.0 3.00 3.0 3.00 3.0
Z_Revenue 2240.0 11.000000 0.000000 11.0 11.00 11.0 11.00 11.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0
Age 2240.0 56.194196 11.984069 29.0 48.00 55.0 66.00 132.0
Observations:¶

The average household income is 52,247.25.

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method¶

In [12]:
# Code to check for missing values
data.isnull().sum()
Out[12]:
0
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Z_CostContact 0
Z_Revenue 0
Response 0
Age 0

Observations:¶
  • Income shows a missing value of 24 in the count

We will fill the missing value in the Income column by imputing the median.

In [13]:
data["Income"] = data["Income"].fillna(data["Income"].median()) ## Complete the code to impute the data with median
In [14]:
# checking for missing values after treatment by imputation
data.isnull().sum() ## Complete the code to check missing values after imputation
Out[14]:
0
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 0
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Z_CostContact 0
Z_Revenue 0
Response 0
Age 0

Observations

The missing value has now been treated.

Question 4: Are there any duplicates in the data?¶

In [15]:
# Code to check for duplicates
data.duplicated().sum()
Out[15]:
0
Observations:¶
  • The dataset does not have any duplicates

Dropping columns which are irrelevant to our analysis.¶

In [16]:
columns_to_drop = ['Dt_Customer','Year_Birth','ID','AcceptedCmp1', 'Z_CostContact', 'Z_Revenue', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Education', 'Marital_Status']
data.drop(columns=columns_to_drop, inplace=True)

Exploratory Data Analysis¶

Univariate Analysis¶

Question 5: Explore all the variables and provide observations on their distributions (histograms and boxplots)¶

In [17]:
# Code to check the shape of the dataset
num_rows, num_cols = data.shape

# Adding narrative to the output
print(f"The dataset has {num_rows} rows and {num_cols} columns.")
print(f"This means there are {num_rows} observations and {num_cols} features in the data.")
The dataset has 2240 rows and 18 columns.
This means there are 2240 observations and 18 features in the data.

PLotting the histogram of each column.¶

In [18]:
# defining the figure size
# defining the figure size
plt.figure(figsize=(12, 10))

for i, feature in enumerate(data.columns): #iterating through each column
    plt.subplot(6, 3, i+1)                  # assign a subplot in the main plot
    sns.histplot(data= data, x= feature, kde = True)    # plot the histogram

plt.tight_layout();   # to add spacing between plots
No description has been provided for this image

Observations

Most of the distributions are right-skewed. A greater amount of the data is concentrated on the left, at the lower end of the spectrum.

PLotting the boxplot of each column.¶

In [19]:
# defining the figure size
plt.figure(figsize=(12, 10))
# defining the figure size
plt.figure(figsize=(12, 10))

# plotting the boxplot for each numerical feature
for i, feature in enumerate(data.columns):    # iterating through each column
    plt.subplot(6, 3, i+1)                     # assign a subplot in the main plot
    sns.boxplot(data=data, x=feature)    # plot the boxplot

plt.tight_layout();   # to add spacing between plots
<Figure size 1200x1000 with 0 Axes>
No description has been provided for this image

Observations

  • The above plots summarize the distribution of both discrete and continuous variables in the dataset.

  • Distribution of some features are skewed while others are somewhat symmetrical.

  • Income: The histogram is right-skewed; this is an indication that the majority of customers have lower incomes.

  • Kidhome and Teenhome: most of the customers have zero kids or teenagers staying with them.

  • Recency: This is a right-skewed distribution. There are lots of customers that have not made purchases recently.

  • MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds: Purchase variables about wines and foods products. These variables show different types of right-skewness. Majority of customers spend less on these items.

  • Complain: The distribution of this variable is highly skewed to the right. it can be seen that only a few customers have complained.

  • Age: This is skewed to the right, indicating the majority of the customers are young.

  • Some outliers can b e observed in the data. These will not be treated as they form part of the dataset.

Bivariate Analysis¶

Question 6: Perform multivariate analysis to explore the relationsips between the variables.¶

Let's check for correlations.

In [20]:
# Code to check for correlation
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True)
plt.show()
No description has been provided for this image

Observations:

  • Several groups of spending categories (MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds) show strong positive correlations amongst themselves. This suggests customers who spend more on one type of product tend to spend more on others.

  • Kidhome (number of children in the household) shows moderate negative correlations with Income and most spending categories. This implies that families with more children tend to have lower incomes and spend less on these products.

  • Income plays a role in spending behavior, particularly for certain product types.

  • The size of the family (kidhome) appears to have a negative impact on income.

  • Quite a number of the variables do not have any correlation.

Let's check for pairplots.

In [21]:
sns.pairplot(data=data, diag_kind="kde")
plt.show()
No description has been provided for this image

Data Preprocessing¶

Scaling¶

  • Let's scale the data before we proceed with clustering.
In [22]:
# scaling the data before clustering
scaler = StandardScaler()
subset = data.copy()
subset_scaled = scaler.fit_transform(subset)
In [23]:
# creating a dataframe of the scaled data
subset_scaled_data = pd.DataFrame(subset_scaled, columns=subset.columns)

K-means Clustering¶

In [24]:
k_means_data = subset_scaled_data.copy() # Code will be used later in cluster profiling

Question 7 : Select the appropriate number of clusters using the elbow Plot. What do you think is the appropriate number of clusters?¶

In [25]:
clusters = range(2, 11)
wcss_k8 = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1) # initialize the kmeans model with n_clusters=k
    model.fit(subset_scaled_data) # fit the kmeans model on the scaled data (subset_scaled_data)
    wcss = model.inertia_
    wcss_k8.append(wcss)

    print("Number of Clusters:", k, "\tWCSS:",wcss)

plt.plot(clusters, wcss_k8, "bx-", marker='o')
plt.xlabel("k")
plt.ylabel("WCSS")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 2 	WCSS: 29497.59180741321
Number of Clusters: 3 	WCSS: 26230.895399686044
Number of Clusters: 4 	WCSS: 24836.807962396113
Number of Clusters: 5 	WCSS: 23798.198564507387
Number of Clusters: 6 	WCSS: 22905.78476209626
Number of Clusters: 7 	WCSS: 21972.671605951775
Number of Clusters: 8 	WCSS: 19867.06721740295
Number of Clusters: 9 	WCSS: 19201.858363788044
Number of Clusters: 10 	WCSS: 18790.32967985496
No description has been provided for this image
Observations:¶
  • The appropriate number of clusters appears to be between 2 and 3.

Question 8 : finalize appropriate number of clusters by checking the silhoutte score as well. Is the answer different from the elbow plot?¶

In [26]:
sil_score = [] # Define an empty list to store silhouette scores
cluster_list = range(2, 10)
for n_clusters in cluster_list:
    # Initialize the kmeans model with the current value of n_clusters
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    # Fit the kmeans model to the scaled data (k_means_data)
    preds = clusterer.fit_predict(k_means_data)
    score = silhouette_score(k_means_data, preds)             # Check the silhoutte score against the predictions
    sil_score.append(score) # Assuming sil_score is a list defined earlier to store the scores
    print("For n_clusters = {}, the silhouette score is {}".format(n_clusters, score))
For n_clusters = 2, the silhouette score is 0.28417340291067655
For n_clusters = 3, the silhouette score is 0.21188794461530194
For n_clusters = 4, the silhouette score is 0.14437407459098453
For n_clusters = 5, the silhouette score is 0.1389441529164155
For n_clusters = 6, the silhouette score is 0.1441860919328272
For n_clusters = 7, the silhouette score is 0.1466393276809718
For n_clusters = 8, the silhouette score is 0.15470127342280662
For n_clusters = 9, the silhouette score is 0.14604007088599213

Observations

n_clusters=2 has the highest silhoutte score of 0.28

In [27]:
# Find the optimal number of clusters
optimal_n_clusters = cluster_list[sil_score.index(max(sil_score))]
print("\nOptimal number of clusters:", optimal_n_clusters)
Optimal number of clusters: 2
In [28]:
#Empty dictionary to store the Silhouette score for each value of k
sc = {}

# iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that k
for k in range(2, 10):
    kmeans = KMeans(n_clusters=k).fit(subset_scaled_data) # Use subset_scaled_data instead of data_scaled
    labels = kmeans.predict(subset_scaled_data) # Use subset_scaled_data instead of data_scaled
    sc[k] = silhouette_score(subset_scaled_data, labels) # Use subset_scaled_data instead of data_scaled

#Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
No description has been provided for this image

Observations

k=2 is Optimal: The highest silhouette score is observed at k=2. This suggests that dividing the data into two clusters results in the best separation and cohesion within the clusters.

In [29]:
#Calculating summary statistics of the original data for each label
data['Labels'] = kmeans.labels_
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()
df_kmeans = pd.concat([mean, median], axis=0)

# Get the number of unique cluster labels
n_clusters = data['Labels'].nunique()

# Dynamically create index labels based on the number of clusters
index_labels = []
for i in range(n_clusters):
    index_labels.extend([f'group_{i} Mean', f'group_{i} Median'])

# Set the index of the DataFrame
df_kmeans.index = index_labels

df_kmeans.T
Out[29]:
group_0 Mean group_0 Median group_1 Mean group_1 Median group_2 Mean group_2 Median group_3 Mean group_3 Median group_4 Mean group_4 Median group_5 Mean group_5 Median group_6 Mean group_6 Median group_7 Mean group_7 Median group_8 Mean group_8 Median
Income 67990.975248 42629.492152 76130.079787 29298.816981 59054.925127 30192.558065 45242.285714 80062.090909 49837.863333 69084.0 42101.0 76542.5 29791.0 59537.5 29510.5 38998.0 77917.5 50637.5
Kidhome 0.148515 0.652466 0.026596 0.841509 0.040609 0.870968 0.666667 0.015152 0.946667 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0
Teenhome 0.495050 0.959641 0.202128 0.067925 0.979695 0.022581 0.523810 0.026515 0.926667 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
Recency 48.064356 50.742152 48.101064 22.105660 48.530457 70.961290 53.047619 50.787879 47.486667 47.5 50.0 48.0 19.0 49.0 74.0 49.0 55.0 51.0
MntWines 631.841584 56.513453 528.436170 28.433962 494.076142 34.032258 169.000000 630.219697 306.366667 587.5 35.0 493.5 12.0 457.5 14.0 34.0 586.5 241.5
MntFruits 55.321782 3.352018 117.829787 6.279245 18.667513 6.932258 24.190476 40.064394 12.280000 48.5 1.0 120.0 3.0 12.0 3.0 6.0 31.0 6.0
MntMeatProducts 268.386139 20.224215 463.351064 21.596226 127.385787 28.625806 112.476190 532.518939 105.933333 240.0 13.5 430.0 13.0 106.0 15.5 30.0 482.5 87.0
MntFishProducts 79.400990 5.356502 144.079787 8.675472 25.329949 9.829032 25.761905 74.678030 19.733333 69.0 2.0 150.0 4.0 16.0 6.0 7.0 63.0 10.0
MntSweetProducts 74.094059 3.704036 102.053191 5.852830 16.690355 6.800000 17.523810 44.700758 16.080000 63.5 1.0 102.5 3.0 11.0 4.0 4.0 35.0 6.0
MntGoldProds 80.668317 12.668161 100.595745 18.479245 59.880711 16.464516 27.476190 58.784091 53.746667 64.5 7.0 83.0 12.0 39.0 11.0 17.0 43.0 39.0
NumDealsPurchases 2.108911 2.053812 1.335106 1.773585 2.926396 1.841935 2.333333 1.174242 7.080000 2.0 2.0 1.0 1.0 3.0 1.5 2.0 1.0 6.0
NumWebPurchases 7.747525 1.997758 4.978723 2.052830 6.218274 2.296774 3.619048 4.193182 5.793333 8.0 2.0 5.0 2.0 6.0 2.0 3.0 4.0 6.0
NumCatalogPurchases 4.915842 0.594170 5.957447 0.528302 3.149746 0.496774 2.047619 6.352273 2.200000 4.0 0.0 6.0 0.0 3.0 0.0 1.0 6.0 2.0
NumStorePurchases 8.564356 3.459641 8.414894 2.992453 8.012690 3.209677 5.238095 8.234848 5.906667 9.0 3.0 8.0 3.0 8.0 3.0 3.0 8.0 6.0
NumWebVisitsMonth 5.272277 5.616592 2.462766 6.924528 5.355330 6.906452 5.809524 2.117424 7.393333 5.0 6.0 2.0 7.0 5.0 7.0 7.0 2.0 7.0
Complain 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
Response 0.331683 0.015695 0.196809 0.290566 0.068528 0.000000 0.142857 0.284091 0.273333 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Age 56.019802 61.937220 55.829787 47.203774 61.974619 47.090323 59.904762 57.204545 57.026667 55.0 61.0 54.0 47.0 62.0 47.0 61.0 58.0 55.0
Observations:¶

Question 9: Do a final fit with the appropriate number of clusters. How much total time does it take for the model to fit the data?¶

In [30]:
# Wri%%time
kmeans = KMeans(n_clusters=2, random_state=0) # Code with the desired number of clusters (n_cluster=2)
kmeans.fit(k_means_data)
Out[30]:
KMeans(n_clusters=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=2, random_state=0)
In [31]:
import time

start_time = time.time()

kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(k_means_data)

end_time = time.time()
total_time = end_time - start_time

print(f"Total time to fit the model: {total_time:.4f} seconds")
Total time to fit the model: 0.0137 seconds
In [32]:
# creating a copy of the original data
data1 = data.copy()

# adding kmeans cluster labels to the original and scaled dataframes
k_means_data["K_means_segments"] = kmeans.labels_
data1["K_means_segments"] = kmeans.labels_

Hierarchical Clustering¶

In [33]:
hc_data = subset_scaled_data.copy()

Question 10: Calculate the cophenetic correlation for every combination of distance metrics and linkage. Which combination has the highest cophnetic correlation?¶

In [34]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_data, metric=dm, method=lm) # Use dm and lm variables for metric and method
        c, coph_dists = cophenet(Z, pdist(hc_data))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.7533004792313172.
Cophenetic correlation for Euclidean distance and complete linkage is 0.73612253810357.
Cophenetic correlation for Euclidean distance and average linkage is 0.8541236643208112.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8108803552015535.
Cophenetic correlation for Chebyshev distance and single linkage is 0.6556214794929751.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.6759851614442833.
Cophenetic correlation for Chebyshev distance and average linkage is 0.7695330133310652.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.7193948146606922.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.7820706370993079.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.6843897408011711.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.8224887488724775.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.6999145308941309.
Cophenetic correlation for Cityblock distance and single linkage is 0.8060129075261067.
Cophenetic correlation for Cityblock distance and complete linkage is 0.5280471519009835.
Cophenetic correlation for Cityblock distance and average linkage is 0.7915031255045084.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.7649520275546621.
In [35]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.8541236643208112, which is obtained with Euclidean distance and average linkage.

Question 11: plot the dendogram for every linkage method with "Euclidean" distance only. What should be the appropriate linkage according to the plot?¶

Let's view the dendrograms for the different linkage methods.

In [36]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    # Calculating the linkage with Euclidean distance and the current linkage method
    Z = linkage(hc_data, metric="euclidean", method=method)

    # Visualizing the Dendrogram with the calculated linkage matrix Z
    dendrogram(Z, ax=axs[i])

    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_data))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
No description has been provided for this image
Observations:¶
  • The Ward linkage appears to be the most promising method for clustering the data using Euclidean distance even though it has the least cophenetic correlation of 0.47.

  • It shows a clear separation of clusters with distinct vertical lines, indicating well-defined clusters.

Question 12: Check the silhoutte score for the hierchical clustering. What should be the appropriate number of clusters according to this plot?¶

In [37]:
sil_score_hc = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
    # Initialize the model with the current number of clusters from cluster_list
    clusterer = AgglomerativeClustering(n_clusters=n_clusters)
    # Fit the model on the scaled data (hc_data) and get predictions
    preds = clusterer.fit_predict(hc_data)
    # Calculate the silhouette score using hc_data and the predictions
    score = silhouette_score(hc_data, preds)
    sil_score_hc.append(score)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
For n_clusters = 2, silhouette score is 0.2403229059475717
For n_clusters = 3, silhouette score is 0.19529185338217378
For n_clusters = 4, silhouette score is 0.20968348277361193
For n_clusters = 5, silhouette score is 0.1263514012994975
For n_clusters = 6, silhouette score is 0.1248134339371427
For n_clusters = 7, silhouette score is 0.12760383576238057
For n_clusters = 8, silhouette score is 0.14095210062230012
For n_clusters = 9, silhouette score is 0.14158008628847135
Observations:¶
  • The highest silhouette score is 0.2403, which is achieved when n_clusters = 2. Forming 2 clusters might be the most appropriate choice based on the silhouette score.

Question 13: Fit the Hierarchial clustering model with the appropriate parameters finalized above. How much time does it take to fit the model?¶

In [38]:
%%time
HCmodel = AgglomerativeClustering(n_clusters=2, metric="euclidean", linkage="ward") # Initialize the HC model with appropriate parameters.
HCmodel.fit(hc_data)
CPU times: user 280 ms, sys: 25.9 ms, total: 306 ms
Wall time: 298 ms
Out[38]:
AgglomerativeClustering()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering()

Observations

  • It takes a total time of 182 ms to fit the model.
In [39]:
# creating a copy of the original data
data2 = data.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_data["HC_segments"] = HCmodel.labels_
data2["HC_segments"] = HCmodel.labels_
In [40]:
hc_data.head()
Out[40]:
Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Complain Response Age HC_segments
0 0.235696 -0.825218 -0.929894 0.307039 0.983781 1.551577 1.679702 2.462147 1.476500 0.843207 0.349414 1.409304 2.510890 -0.550785 0.693904 -0.097282 2.388846 0.985345 0
1 -0.235454 1.032559 0.906934 -0.383664 -0.870479 -0.636301 -0.713225 -0.650449 -0.631503 -0.729006 -0.168236 -1.110409 -0.568720 -1.166125 -0.130463 -0.097282 -0.418612 1.235733 1
2 0.773999 -0.825218 -0.929894 -0.798086 0.362723 0.570804 -0.177032 1.345274 -0.146905 -0.038766 -0.685887 1.409304 -0.226541 1.295237 -0.542647 -0.097282 -0.418612 0.317643 0
3 -1.022355 1.032559 -0.929894 -0.798086 -0.870479 -0.560857 -0.651187 -0.503974 -0.583043 -0.748179 -0.168236 -0.750450 -0.910898 -0.550785 0.281720 -0.097282 -0.418612 -1.268149 1
4 0.241888 1.032559 -0.929894 1.550305 -0.389085 0.419916 -0.216914 0.155164 -0.001525 -0.556446 1.384715 0.329427 0.115638 0.064556 -0.130463 -0.097282 -0.418612 -1.017761 1
In [41]:
data2.head()
Out[41]:
Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Complain Response Age Labels HC_segments
0 58138.0 0 0 58 635 88 546 172 88 88 3 8 10 4 7 0 1 68 0 0
1 46344.0 1 1 38 11 1 6 2 1 6 2 1 1 2 5 0 0 71 1 1
2 71613.0 0 0 26 426 49 127 111 21 42 1 8 2 10 4 0 0 60 0 0
3 26646.0 1 0 26 11 4 20 10 3 5 2 2 0 4 6 0 0 41 3 1
4 58293.0 1 0 94 173 43 118 46 27 15 5 5 3 6 5 0 0 44 5 1
In [42]:
subset_scaled_data["HC_Clusters"] = HCmodel.labels_
data["HC_Clusters"] = HCmodel.labels_

Cluster Profiling and Comparison¶

K-Means Clustering vs Hierarchical Clustering Comparison¶

Question 14: Perform and compare Cluster profiling on both algorithms using boxplots. Based on the all the observations Which one of them provides better clustering?¶

In [43]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster in Kmeans Clustering")

# Get the number of numerical columns to plot
num_cols = len(data1.select_dtypes(include=['number']).columns)
num_rows = (num_cols + 2) // 3  # Calculate rows for subplots

# Iterate over numerical variables and create boxplots
for i, variable in enumerate(data1.select_dtypes(include=['number']).columns):
    if variable != "K_means_segments":  # Skip the cluster label column itself
        plt.subplot(num_rows, 3, i + 1)
        # Filter data to include only the desired cluster labels (0 and 1)
        filtered_data = data1[data1['K_means_segments'].isin([0, 1])]
        sns.boxplot(data=filtered_data, x="K_means_segments", y=variable, palette='Spectral')

plt.tight_layout(pad=2.0)
plt.show()
No description has been provided for this image
In [44]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster in Hierarchial Clustering")

# Get the number of numerical columns to plot
num_cols = len(data2.select_dtypes(include=['number']).columns)
num_rows = (num_cols + 2) // 3  # Calculate rows for subplots

# Iterate over numerical variables and create boxplots
for i, variable in enumerate(data2.select_dtypes(include=['number']).columns):
    if variable != "HC_segments":  # Skip the cluster label column itself
        plt.subplot(num_rows, 3, i + 1)
        # Filter data to include only the desired cluster labels (0 and 1)
        filtered_data = data2[data2['HC_segments'].isin([0, 1])]
        sns.boxplot(data=filtered_data, x="HC_segments", y=variable, palette='Spectral')

plt.tight_layout(pad=2.0)
plt.show()
No description has been provided for this image
Observations:¶

K-Means clustering appears to show better separation in certain variables compared to Hierarchical Clustering. Hierarchical clustering tends to have overlapping distributions in multiple cases.

K-Means is the preferred choice for clear segmentation with well-separated clusters

Question 15: Perform Cluster profiling on the data with the appropriate algorithm determined above using a barplot. What observations can be derived for each cluster from this plot?¶

In [45]:
plt.figure(figsize=(20, 20))
plt.suptitle("Barplots of all variables for each cluster")

# Filter data to include only the desired cluster labels (0 and 1)
filtered_data = data1[data1['K_means_segments'].isin([0, 1])]

# Get the number of numerical columns to plot
num_cols = len(filtered_data.select_dtypes(include=['number']).columns)
num_rows = (num_cols + 2) // 3  # Calculate rows for subplots

# Iterate over numerical variables and create barplots
for i, variable in enumerate(filtered_data.select_dtypes(include=['number']).columns):
    if variable != "K_means_segments":  # Skip the cluster label column itself
        plt.subplot(num_rows, 3, i + 1)
        sns.barplot(data=filtered_data, x="K_means_segments", y=variable, palette='Spectral', errorbar=None)

plt.tight_layout(pad=2.0)
plt.show()
No description has been provided for this image
Observations:¶
  • Income: Cluster 1 presents much higher median income values compared to Cluster 0.
  • Recency: The median of Cluster 0 is significantly higher, indicating that in this cluster, the customers have been inactive for a longer period.
  • Product Spending: The first cluster has spent much on wines and meat products. Cluster 0 spends more on sweet products and fish products compared to Cluster 1.
  • Purchasing Behaviour: Cluster 1 spends more on their purchases through web and catalog, whereas Cluster 0 does their purchases from the stores and is engaged in purchasing more deals.
  • Kidhome and Teenhome -There is little variation in this cluster on both these variables; perhaps these were very weak drivers in the cluster identification process.
In [46]:
# lets display cluster profile

# Assuming data1 contains cluster labels and relevant features
cluster_profile = data1.groupby('K_means_segments').mean()

# Style the output
cluster_profile.style.highlight_max(color="green", axis=0)
Out[46]:
  Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Complain Response Age Labels
K_means_segments                                      
0 38840.161643 0.706104 0.555388 48.812359 103.334589 6.301432 36.353429 9.343632 6.252449 22.100226 2.560663 2.920121 0.864356 3.914846 6.444612 0.010550 0.096458 54.876413 3.445365
1 71711.030120 0.063527 0.434830 49.541073 595.499452 55.372399 356.765608 78.486309 57.309967 75.883899 1.982475 5.777656 5.274918 8.515882 3.676889 0.007667 0.225630 58.109529 3.663746
In [47]:
# lets display cluster profile

# Assuming data2 contains hierarchical cluster labels and relevant features
cluster_profile = data2.groupby('HC_segments').mean()  # Use data2 and 'HC_segments'

# Style the output
cluster_profile.style.highlight_max(color="green", axis=0)
Out[47]:
  Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Complain Response Age Labels
HC_segments                                      
0 67501.717257 0.132743 0.553097 48.497345 541.935398 46.183186 301.081416 66.417699 47.736283 70.505310 2.571681 5.804425 4.598230 8.072566 4.323894 0.000000 0.202655 58.578761 3.965487
1 36699.211261 0.761261 0.458559 49.732432 61.647748 6.063063 30.401802 8.112613 6.017117 17.061261 2.073874 2.334234 0.690991 3.466667 6.327027 0.018919 0.094595 53.766667 3.095495

Actionable Insights and Recommendations¶

K-Means Segments

  • Cluster 0 represents lower-income, budget-conscious shoppers Moderate age (55 years). More children at home. Lower spending across all categories. Prefers in-store shopping, browses online but purchases less. Less responsive to promotions.

  • Cluster 1 represents wealthier, high-spending customers Fewer children at home. Significantly higher spending on all product categories. More catalog and web purchases. More engaged with promotions and less likely to complain.

Hierarchical Segments

  • HC Clustering offers a sharper contrast between high- and low-income groups

  • The income gap between clusters is more pronounced in HC (USD 67.5K vs. USD36.7K) than in K-Means.

  • HC segments are more extreme, meaning low-income consumers buy even less, and high-income consumers buy even more.

  • HC Clustering shows more separation in purchase behaviors

  • The difference in spending on wine, meat, and luxury items is sharper in HC clustering.

  • HC low-income consumers visit websites more but purchase less.

  • K-Means has more balanced segmentation

  • The low-income segment (Cluster 0 in K-Means) still spends on some products, while in HC, they spend significantly less.

  • Response to promotions is more extreme in HC (20.2% vs. 9.4%), while in K-Means, it's more evenly spread.

Business Recommendations

  • Based on different purchasing behaviours and preferences, marketing campaigns could be targeted and tailored to each segment's needs.
  • For high-value customers, implement customer retention strategies to foster long-term relationships with this segment. For example, provide personalized customer service to enhance their experience and encourage repeat orders.
  • Enhance the in-store experience for customers who prefer to shop in-person
  • For customers who prefer to shop online, invest in online and catalog marketing to reach them and provide a seamless shopping journey across these channels.